Zurich by the Numbers - Predictive Insights into Tourism Dynamic

Authors
Affiliation

Name I, First Name I

University of Lausanne

Name II, First Name II

Published

May 1, 2024

Abstract

The following Forecasting project focuses on applying forecasting techniques to predict tourism trends in Zurich. This analysis aims to harness the power of historical data combined with forecasting algorithms to provide actionable insights into future tourism patterns. We engage in comprehensive data preparation, explore various predictive models, and conduct a detailed evaluation of their forecasting accuracy. The project encapsulates the challenge of turning complex data into understandable and strategic information, crucial for effective decision-making in Zurich’s tourism sector.

1 DATA

1.1 Cleaning

1.1.1 Tourism Data - All

Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)
tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))

#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#print unique values in month column
unique(tourism_data$Monat)
#>  [1] "Januar"    "Februar"   "März"      "April"     "Mai"      
#>  [6] "Juni"      "Juli"      "August"    "September" "Oktober"  
#> [11] "November"  "Dezember"
# change ' [1] "Januar"    "Februar"   "März"      "April"     "Mai"       "Juni"      "Juli"      "August" "September" "Oktober"   "November"  "Dezember" into english month'
tourism_data$Monat <- tourism_data$Monat %>% recode_factor(
  "Januar" = "January",
  "Februar" = "February",
  "März" = "March",
  "April" = "April",
  "Mai" = "May",
  "Juni" = "June",
  "Juli" = "July",
  "August" = "August",
  "September" = "September",
  "Oktober" = "October",
  "November" = "November",
  "Dezember" = "December"
)
#add date type column for plotting purposes
tourism_data <- tourism_data %>% mutate(Date = dmy(paste("01", Monat, Jahr)))
#check for NAN
sum(is.na(tourism_data))
#> [1] 51395
#analyse the NAN values, where are they
(tourism_data %>% filter(is.na(value)))
#> # A tibble: 51,395 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Schwe~ Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Schwe~ Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Schwe~ Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Schwe~ Janu~ 2005     NA 2005-01-01
#> # i 51,385 more rows
#show data using reactable only showing the first 100 rows
reactable::reactable(head(tourism_data, 1000))

1.1.2 Tourism Data - Zurich

Click to show code
#filter column 'Kanton' for Zurich
tourism_data_zurich <- tourism_data %>% filter(Kanton == "Zürich")
#check for NAN
sum(is.na(tourism_data_zurich))
#> [1] 1869
#analyse the NAN values, where are they
tourism_data_zurich %>% filter(is.na(value))
#> # A tibble: 1,869 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Zürich Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Zürich Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Zürich Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Zürich Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Zürich Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Zürich Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Zürich Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Zürich Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Zürich Janu~ 2005     NA 2005-01-01
#> # i 1,859 more rows

#show the data in a table using reactable
reactable::reactable(head(tourism_data_zurich, 1000))

1.1.3 Tourism Data - Zurich and Philipines

Click to show code
tourism_data_zurich_philippines <- tourism_data_zurich %>% filter(Herkunftsland == "Philippinen")
#show table using reactable
reactable::reactable(tourism_data_zurich_philippines)

1.1.4 Deal with NAN

We have none in the data filtered with zurich and philippines, but if we would have we would :

1.1.4.1 Impute missing values ARIMA

If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.

Click to show code
# #Creating a tsibble with missing values
# data <- tourism_data_zurich_philippines %>%
#   as_tsibble(key = c(Kanton, Herkunftsland, Monat, Jahr)) %>%
#   select(Date, value) %>%
#   fill_gaps()
# 
# # Fit an ARIMA model to data with missing values
# model_fit <- data %>%
#   model(ARIMA(value))
# 
# # Interpolate missing values using the fitted ARIMA model
# filled_data <- model_fit %>%
#   interpolate(data)
# 
# # Print the data with filled in missing values
# print(filled_data)

2 EDA - Zurich

2.1 Zurich and All visiting countries

Click to show code
# Preparing the data
#removing value 'Schweiz' in column 'Herkunftsland' as it is just the whole of Switzerland
data <- tourism_data_zurich %>%
  filter(!is.na(value)) %>%  # Removing rows with NA values in the 'value' column
  mutate(Monat = month(Date, label = TRUE, abbr = TRUE),  # Extract month from Date
         Jahr = year(Date)) %>%  # Extract year from Date
  group_by(Herkunftsland, Date) %>%  # Group by country and date
  summarise(Trips = sum(value), .groups = 'drop')  # Summing up trips for each country per date

p <- ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,
                      color = Herkunftsland == "Philippinen",
                      text = paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) +
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "grey")) +
  labs(title = "Number of Trips from Each Country to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = "text")

# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
         legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))  # Adjust legend position

# Display the interactive plot
interactive_plot

2.2 Zurich and Philipines Visitors

Click to show code
# use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axis
p <- ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +
  geom_line() +
  labs(title = "Number of Trips from Philipines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()
p

2.2.1 Pattern

2.2.1.1 Decompose

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Decompose the time series
decomposed <- decompose(tourism_ts)

# Plot the decomposed components
plot(decomposed)

2.2.1.2 Seasonality

Click to show code
# Plot the seasonality in one chart 
ggseasonplot(tourism_ts, year.labels = TRUE, year.labels.left = TRUE)

Click to show code
# several chart per month to see the seasonality
ggsubseriesplot(tourism_ts)

3 MODELLING

This part is about building on your knowledge of time series techniques to model your data. You can investigate various models but you should justify in your report your choices regarding these. Pay attention to the conditions that are needed to apply a specific model. Treat also carefully seasonality, outliers, colinearity, covariates, special events, etc. Remember the following steps: (a) Aggregation choice for hierarchical time series (b) Model building (c) Model selection

3.1 Outliers

3.2 Correlation

3.3 Special Event

3.3.1 Covid impact

How to deal with this blackswan event ?

  • Incorporating Dummy Variables Introduce dummy variables into your forecasting models to account for the impact of COVID-19. These variables can be set to 1 for the periods affected by the pandemic and 0 otherwise. This approach allows the model to differentiate the impact of COVID-19 from normal variations in the data.
  • Replacing covid values
Click to show code
# analyse the impact of covid on the data from december 2019 to 2023
covid_impact <- tourism_data_zurich_philippines %>%
  filter(Date >= "2019-11-01" & Date <= "2023-01-01") %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line() +
  labs(title = "Impact of Covid on Tourism from Philipines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()
covid_impact

How to deal withc

3.4 Exponential Smoothing

3.4.1 Model ETS 1

Click to show code
#convert tourism_ts to tsibble
tourism_ts <- tourism_ts %>% as_tsibble()
# Fit an ETS model
# Adjusting the model parameters according to the characteristics of the data
# Here "A" means additive error, "N" means no trend, and "N" means no seasonality
# change these if needed
fit <- tourism_ts %>%
  model(ETS = ETS(value ~ error("A") + trend("A") + season("A")))
# Forecast the next 6 periods
forecast <- fit %>%
  forecast(h = 6)
# Plot the forecasts along with the historical data
plot <- forecast %>%
  autoplot(tourism_ts)

plot

3.5 ARIMA

Do we need to differentiate the data ?

Click to show code
# # differentiate the data
# tourism_ts <- tourism_ts %>% model(difference = difference(value))
# # Fit an ARIMA model
# fit <- tourism_ts %>%
#   model(ARIMA = ARIMA(value))
# # Forecast the next 6 periods
# forecast <- fit %>%
#   forecast(h = 6)
# # Plot the forecasts along with the historical data
# plot <- forecast %>%
#   autoplot(tourism_ts)
# plot

3.6 Combining Forecasts

Click to show code
# Combine the forecasts from the ETS and ARIMA models
combined_forecast <- fit %>%
  forecast(h = 6, method = "comb")
# Plot the combined forecasts along with the historical data
plot <- combined_forecast %>%
  autoplot(tourism_ts)
plot